Construction and Refinement of Coarse-Grained Models 
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A general scheme, which includes constructions of coarse-grained (CG) models, weighted ensemble 
dynamics (WED) simulations and cluster analyses (CA) of stable states, is presented to detect 
dynamical and thermodynamical properties in complex systems. In the scheme, CG models are 
efficiently and accurately optimized based on a directed distance from original to CG systems, which 
is estimated from ensemble means of lots of independent observable in two systems. Furthermore, 
WED independently generates multiple short molecular dynamics trajectories in original systems. 
The initial conformations of the trajectories are constructed from equilibrium conformations in 
CG models, and the weights of the trajectories can be estimated from the trajectories themselves 
in generating complete equilibrium samples in the original systems. CA calculates the directed 
distances among the trajectories and groups their initial conformations into some clusters, which 
correspond to stable states in the original systems, so that transition dynamics can be detected 
without requiring a priori knowledge of the states. 

PACS numbers: 02.70.Ns, 82.20.Wt 
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Atomistic molecular dynamics (MD) simulations are 
very powerful tools to accurately evaluate dynamics and 
thermodynamics properties of complex systems, however 
they are limited to systems with small size and phe- 
nomena of short time. Recently, many different mul- 
tiscale techniques are developed to extend the tempo- 
ral/spatial scales of simulations [J, H, H, 0]. Coarse- 
grained (CG) modeling, which reduces degrees of freedom 
and parameterizes effective interactions, offers apromis- 
ing way to surmount the limitations [3, d, H 0> Hi- 
The effective interaction of CG models, such as U(x), 
is usually required to match the free energy surface, 
F(x) — — In J e~ v ( r )5(x — x(r))dr, in whole the CG con- 
formational space, x. Here x — x(r) are conformational 
functions in the original system with potential energy 
surface, V(r). §(■■■) is the Dirac-<5 function, and fceT 
is set as the unit of energy. In CG approaches, some as- 
sumptions and approximations are inevitably introduced, 
because it is very difficult (if not impossible) to get an 
analyzed F(x) in the high-dimension space, x. In the 
other hand, some interesting properties, such as transi- 
tion dynamics, may be changed in the CG approaching. 
It is important to high efficiently construct CG models 
and to refine the CG models while it is necessary. 

A key of CG approaches is to define a cheap and ac- 
curate distance between CG models, such as U(x), and 
original systems, such as V(r), or the corresponding free 
energies, F(x). In traditional CG approaches [5[, the dis- 
tance is defined from ensemble means of some (arbitrar- 
ily) selected variables in the two systems. For example, 
ones calculate the difference of radial distribution func- 
tion g(z) in U(x) and Vir) and define the distance as 
Dtrad = J dzp(z)[(g(z)}u - (g{z)) v ? @, @. Here p(z) 



is an optional weight, and 



are ensemble means. 



conformations, such as x l ,i = I, • • • , M, then define the 
distance as the mean of SV(x) — F(x) — U(x) or its 
derivative in the sample, i.e., Dee — jj Yl[^( xi )] 2 or 
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In more recent works 0, Hj], ones directly estimate val- 
ues of F(x) or its gradients §^ at some sampled CG 



] 2 xl . While D FE or D F ed takes 
into account the overall characteristic of F(x), the cal- 
culation of F(x l ) or §~ is usually very time-consuming. 
In contrast, the traditional CG approaches are easier to 
be calculated but effects due to the arbitrary selection of 
variables are not very clear. 

In this letter, we first define a directed distance be- 
tween any two systems based on ensemble means of a 
complete basis function set, then estimate the distance 
by calculating the means of some interesting observable 
in large conformational samples of the two systems. Thus 
we can efficiently and accurately optimize parameters of 
effective interactions of CG models by minimizing the 
directed distance. Furthermore, we present weighted en- 
semble dynamics (WED) simulations and cluster anal- 
yses to refine CG models to exactly reproduce dynam- 
ics and thermodynamics in original systems. WED ran- 
domly CG conformations and arbitrarily adds into the 
missing degrees of freedom with short relaxations to form 
initial conformations, then independently generates mul- 
tiple short molecular dynamics simulations in the original 
systems. Besides statistically detecting ensemble dynam- 
ics in the original systems, WED reproduces the equi- 
librium properties by weighting these trajectories. The 
weights, which are independent on the short relaxation 
simulations, can be estimated from the trajectories them- 
selves, or from a self-consistent equation based on cluster 
analyses (CA) of the trajectories. Without requiring a 
priori knowledge of stable states, we calculate directed 
distances among the trajectories and group their initial 
conformations into clusters {i.e. stable states), and iden- 
tify transition trajectories among the stable states to de- 
tect the corresponding transition dynamics. The CG- 
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WED-CA scheme provides a complete way in analyzing 
stable states, detecting transition dynamics, as well as 
enhanced sampling in complex systems. 

A natural definition of the distance between two po- 
tentials, such as U(x) and F(x), may be their overlap, 
d(U,F) = (<t>u(x)\<t> F (x)). Here <p F (x) oc e~ F ^/ 2 and 
4>u{x) oc e~ utyX ^ 2 have already been normalized. How- 
ever, it is difficult to use the overlap to parameterize 
effective potentials of CG models. Alternately, we de- 
fine a directed distance s FU = ([SfWf,u(x)] 2 )f, where 
the weight function Wf,u(x) oc e F( - x ^ u(x \ 5fA(x) = 
A(x) — (A(x))p, and (Wf,u(x))f = 1 without losing any 
generality. For any variable A(x), 



\(A(x))u ~ (A{x)) F \ < a s F ,u, 



(1) 



where a is the fluctuation of A(x) in the F system. 
Thus sf,u measures the deviation of U{x) from F(x), 
since it provides a upper limit of errors in calculating en- 
semble means of any thermodynamical variable. While 
SfWf,u{x) <C 1, the directed distance is equivalent to 
the overlap, d(U, F) . 

We expand Wf,u(x) in an arbitrary complete basis set, 
such as, {A fl (x)}, 



W F ,u(x) = l+g^(F) (SfA^(x))u 5 f A»{x) 



(2) 



where g^^F) is the inverse matrix of the variance- 
covariance matrix of basis functions, g^ v (F) = 
(8fA^{x) 8fA u (x))f- Here we used the Einstein sum- 
mation notation. Thus, 



>F,t7 



g^a^cf 



(3) 



where g M „ and a M = {A^)u — {A^)f are simple notations 
of g^ v {F) and (SfA' 1 (x))u, respectively. In this letter, 
we generally use U (x) to denote effective interactions of 
CG models, V(r) to original systems, and F(x) to free 
energy surfaces of V(r) in the x space. We also denote 
the missing degrees of freedom in the CG approaches as 
y, so that (x,y) — r. In principle, eq.(H]) is exact and 
independent on the applied U(x). An analyzed free en- 
ergy F(x) in any high-dimension space can be obtained 
by calculating the ensemble means of basis functions in 
V(r) and in an arbitrary U{x). In practice, because the 
ensemble means are estimated in finite-size samples, and 
a finite subset of the basis set is applied instead of whole 
the basis set, only an approximated F(x), is obtained 
in the expansion. However, the directed distance sf,u 
provides a good approximation of the deviation of U (x) 
from V(r), if most interesting observable are included in 
the basis set. Based on the directed distance, CG models 
are constructed as below: (1) sample M conformations in 
V(r) (or in a reference system if the simulations in V(r) 
is too expensive), and calculate the means and variance- 
covariance matrix of chosen basis functions in the confor- 
mational sample. Here each basis function corresponds 



to a M— dimension vector, thus the number of indepen- 
dent basis functions is not more than M. The applied 
basis vectors can be orthogonalized and normalized to 
make g^ v be the unit matrix; (2) set initial values of 
parameters of U{x); (3) generate CG conformations in 
the current U{x) and calculate sf.u (and its derivative 
to the parameters of U (x) , if it is needed) ; (4) optimize 
the parameters of U(x) by minimizing sf,u m some stan- 
dard techniques, e.g., the conjugate gradient method. In 
eq.([3]), interesting and important observable should be 
included in the basis set, so that the formed CG model 
at least reproduce the observable very well. In compari- 
son with Dtrad in the traditional CG approach [H, Q and 
Dfe in the free-energy-based CG methods [H, Sf,u takes 
into account the pair correlation among the selected basis 
functions to capture overall characteristics of F(x). 

We further refine the formed CG models by devel- 
oping ensemble dynamics techniques 0, [l(J. Instead 
of the normal long single-trajectory simulations, ensem- 
ble dynamics simulations generate independently multi- 
ple short molecular dynamics trajectories in distribut- 
ing computers and statistically analyze dynamical be- 
haviors of systems. For example, Pande et al. gener- 
ated hundreds of thousand nanosecond-scale trajectories 
and found a few microsecond-scale folding events Il( in 
all-atomic protein models with explicit water molecules. 
However, the ensemble dynamics usually arbitrarily se- 
lects initial conformations of the simulations in known 
states (e.g., the folded and unfolded states of proteins) to 
identify particular transitions within the total simulation 
time scale, the distribution of the conformations collected 
from all the trajectories may be unknown. We present 
weighted ensemble dynamics (WED) simulations which 
independently generates trajectories same as the normal 
ensemble dynamics, but the initial conformations of the 
trajectories are constructed from equilibrium conforma- 
tions in U(x), by arbitrarily adding the missing degrees 
of freedom, y, with short relaxation simulations in V(r). 
The trajectories contribute to equilibrium properties of 
V(r), with weights, {wi}, where 



W; 



t-Aj 



Wv,u(ri(T))dT. 



(4) 



Here W VtU {r) oc e v(r)-u(x(r)) ^ and t ig thc length of tra _ 
jectories. Aj > is selected so that the obtained w.^ 
almost does not changes as varying A^. 

Let us verify WED by considering the ensemble of all 
MD trajectories with length t. Since the trajectories 
are same generated from the Newtonian (or Langevin) 
equation of motion, we have, (1) conformations in a tra- 
jectory have the same weight in contributing to equi- 
librium properties; (2) trajectories started from an ini- 
tial conformation have the same weight in the contri- 
bution, if the initial velocities are unbiasedly formed 
from the Maxwell velocity distribution. We denote the 
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sub-ensemble of the trajectories which started from an 
initial conformation, such as ro, as [ro;i], and denote 
the mean of any A(r) in the sub-ensemble as (A)\ ro .q. 
While t is not very short, the conformational distribu- 
tions of sub-ensembles of the trajectories, which started 
from neighboring conformations, such as [r ; t] and [r' ; t], 
may be identical, (i.e., (A) [ra . t] = (A)y Q . t ] for any A(r)). 
In other words, these initial conformations are equiv- 
alent in the t— length MD simulations, they belong to 
the same stable conformational region, wherein the MD 
simulations easily reach equilibrium within t. Here the 
conformational distribution in the sub-ensemble [ro;t], 
p [r ;t](r) = j J G(r ,0;r,t')dt', and G(r ,0;r,t) is the 
propagator of the applied MD algorithm (i.e., simula- 
tor). Therefore, we strictly define stable conformational 
regions in any time t, without requiring to analyze if the 
time scales of the simulation dynamics are separated well, 
or if some particular trajectories already happened tran- 
sitions. We conclude that all the trajectories started from 
a stable conformational region have the same weight in 
contributing to equilibrium properties, but trajectories 
started from different stable regions might have different 
weights. The total weights of the trajectories from a sta- 
ble region should be proportional to the free energy of the 
stable region. It is worthy mentioning the stable states 
are not only dependent on the length of trajectories, t, 
they are also dependent on the applied simulator itself. 
In other words, the stable states are dependent on the 
applied propagator in the simulations. It might provides 
a way in detecting the possible effects of thermostats in 
canonical-ensemble MD simulations. 

For any (small) t, ones can start from conformations 
sampled in a model, such as U(r), to independently 
generate t— length MD trajectories in V(r). The tra- 
jectory from an initial conformation, such as r§, con- 
tributes to equilibrium properties of V(r) with the weight 
Wk = yVu,v(ro). Here U(r) is a cheaper and smoother 
approximation of V(r) with the same resolution, r. For 
example, U(r) is an all- atomic force field with a higher 
temperature while V(r) is the ab initio interaction with 
a lower temperature. It is a rather different challenge 
while a CG model, U(x), is applied as the starting point, 
since initial conformations of MD trajectories must be 
constructed by subtly adding into the missing degrees of 
freedom, y, so that the distribution of the formed con- 
formations is known, thus the weights of the trajectories 
can be estimated. The subtle construction dominantly 
determines the accuracies and efficiencies of the current 
CG-based enhanced sampling methods, such as the reso- 
lution exchange method [13j. However, as our discussion 
above, while t is not very short, the initial conformations, 
{ro}, are grouped into stable regions, the weights of all 
the trajectories started from the same region can be set 
as a constant, such as, Wk = w[a] if r§ € a. Here a in- 
dexes the stable regions. It is an important improvement 
to replace the weights yVu,v( r o) with w[a], since the lat- 



ter does not change while the initial conformations are 
shortly relaxed. Therefore, we can arbitrarily add the 
missing degrees of freedom, y, into CG conformations 
sampled in U(x) to form some {r fc }, then we (minimize 
V(r) a few steps by constraining the CG variable, x, if it 
is necessary, and) run short (in comparison with t) nor- 
mal MD simulations in V(r) to relax these {r k } to {r k } 
to remove the possibly interatomic overlap. Finally, we 
independently generate multiple t— length trajectories in 
V(r) started from {r k }. Although the short relaxations 
makes the distribution of the initial conformations be un- 
known, it does not change the stable regions which the 
conformations belong to. Here it is possible that confor- 
mations with same x but different y belong to different 
stable regions, thus trajectories from these conformations 
may have different weights. We can analyze and group 
{r k } into some regions, and estimate ensemble means as, 

/4 ,_E a M«]E,> 6a ^l 

{A} ~ J2 a n a w[a] ' (5) 

where n a is the number of these r k inside the a th region, 
and ^4[r^;£] is the mean of any A(r) in the trajector- 
ies) started from r k . Here w[a] oc n" 1 J a e~ v ^dr. A 
single trajectory (or multiple trajectories for getting bet- 
ter statistics) from an initial conformation, such as r„, 
is applied to represent the sub-ensemble of trajectories, 
[V*;t]. It is possible to estimate w[a) from the average 
of Wu,v( r ) m t ne a region if n a is sufficient large. How- 
ever, we can more efficiently estimate w[a] by supposing 
each t— length MD trajectory reaches the local equilib- 
rium in the stable region whose initial conformation be- 
longs to. Although a small fraction of trajectories might 
transition out of their initial stable regions, the part of 
such a trajectory before the transition is still supposed 
to be long enough to reach the local equilibrium. Thus, 
instead of identifying stable regions of these initial con- 
formations, the weights of trajectories (or more exactly, 
of sub-ensembles), {iVi}, can be directly estimated from 
eq.(|l]). If a trajectory happens a transition, only the first 
part before the transition is applied in the estimate by 
using a positive Aj in eq.([4]). Here multiple trajecto- 
ries from an initial conformation can be generated to get 
better estimate of the weight of the sub-ensemble, if it 
is necessary. Ensemble mean of any A(r) is estimated as 

(A) = E fc w k HlrH;t] 

The initial conformations of trajectories can be 
grouped into clusters by calculating the directed dis- 
tances among the trajectories. Each cluster, wherein the 
distances are smaller than a chosen threshold value, cor- 
responds a stable region of the system in the time t and in 
the applied simulator. Some of these trajectories might 
be supposed to happen transitions, for example, if they 
large deviate from the other states, or if a positive A 
is judged to apply in the calculation of u>i from eq. ([!]). 
We can calculate the distance of the two ending parts 
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of these trajectories from the other states to detect the 
corresponding transitions. Generally, the deviation of a 
single conformation, r k , from a sample X, is 



4 rt =vW SxA^r*) 5 x A»{r k ). 



(6) 



Here 8xA^{r k ) is the difference of the value of A^(r) at 
r k from its mean in the sample X. Then the deviation of 
multiple independent conformations, {r k }, k — 1, ■ • ■ , to, 
from X is, s 2 x ^ rk , = J2 k s 2 v _ h . Due to the finite sizes 



of samples, s 



X,{r k } 



ik *X,r k - 

is in the order of s 2 



n(- + ti) 

v m M I 

rather than zero, even while the distribution of {r k } is 
same as that of X. Here n is the number of the applied 
basis functions, and M is the size of X. Thus, s 2 provides 
a reference in the cluster analyses of initial conforma- 
tions. Therefore, without requiring a priori knowledge of 
stable states, the cluster analyses (CA) forms a network 
of stable states in original systems. The free energies of 
the stable states can be estimated from the weights of tra- 
jectories which started from the states, without requiring 
to know conformational boundaries of the states. 

In estimating weights of trajectories (or of sub- 
ensembles) from eqQ, the relaxation of initial conforma- 
tions should be short in comparison with the length of 
each trajectory, i, so that the relaxation does not change 
the stable regions of these initial conformations. It is 
possible to generate trajectories from arbitrary initial 
conformations and estimate their weight. Consider an 
arbitrary conformational sample, X = {r k }, we gener- 
ate trajectories from {r k }, and group the conformations 
into stable regions. Although the distribution of r k is 
unknown, the weight function yVx,v(r) can be defined 
and be expanded based on eq.([2]) with some unknown 
variables, w[a], a = 1, 2, • • •. In the other hand, w[a] can 
be written as the average of Wx,v(r k ) in the a th region, 
thus 



= 1 



Es npwl/3] 



(7) 



where T al3 = g^(X) A^(r k )[a} A"[r*;i]. 

A^(r k )[a] = Erjea A ^( r a)' and n a is the number of 
the initial conformations in the a region. Here the mean 
of A^ in X was already set as zero. While the size of X 
is large, and n a in each a region is large, w[a] can be 
estimated well from eq.([7]), no matter how these initial 
conformations are constructed. 

The CG-WED-CA approach provides a scheme in an- 
alyzing stable conformational regions and ensemble dy- 
namics within the total simulation time scale, as well as 
in enhanced sampling in complex systems, such as biolog- 
ical macromolecules: (i) construct CG models and gener- 
ate complete CG samples; (ii) construct initial conforma- 
tions in original systems to start WED simulations and 



calculate weights of the MD trajectories; (iii) analyze sta- 
ble states of the initial conformations ( and may compare 
with available experimental results, such as, native folded 
states and typical partially folded states in proteins); (iv) 
statistically detect transitions among the states from the 
trajectories. In enhanced sampling, the scheme is much 
flexible and efficient in comparison with the previous 
methods, such as replica exchange method [12J or its de- 
velopment, the resolution exchange method [13j ]. where 
either many replicas are needed or the extra degrees of 
freedom must be subtly added so that the formed confor- 
mations satisfy a known distribution. The scheme also 
provides a start point to detect dynamics in longer time 
scale than the total simulation time, based on the slow 
dynamics techniques between two known ends, such as 
the transition path sampling [3]. Ones also might find 
a controllable way to modify the probability of generat- 
ing trajectories in the trajectory spaces so that the slow 
transition trajectories are more focused on. 
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